library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Getting Started with ggplot

We will take another look at the mpg data. glimpse is yet another useful way of getting a quick look at a dataset.

glimpse(mpg)
## Rows: 234
## Columns: 11
## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
## $ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
## $ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
## $ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
## $ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
## $ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
## $ drv          <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
## $ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
## $ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
## $ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
## $ class        <chr> "compact", "compact", "compact", "compact", "compact", "c…

Your first ggplot object. Hopefully this is familiar.

This R code snippet creates a scatter plot using the ggplot2 library. Let’s break it down:

ggplot(data = mpg):
    This line initiates the ggplot2 plotting system.
    data = mpg specifies that the data for the plot will come from the mpg dataset, which is a built-in dataset in R that contains information about fuel economy for different cars.

+ geom_point(mapping = aes(x = displ, y = hwy)):
    + is used to add layers to the ggplot object.
    geom_point() specifies that we want to create a scatter plot.
    mapping = aes(x = displ, y = hwy) defines how the data should be mapped to the plot:
        x = displ: The displ variable (engine displacement) will be plotted on the x-axis.
        y = hwy: The hwy variable (highway miles per gallon) will be plotted on the y-axis.
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))

We can simplify this code. It assumes the data is the first argument. And the aes command assumes x and y are the first two.

ggplot(mpg) + 
  geom_point(aes(displ, hwy))

We can colour-code the points based on another variable (e.g., class of the car).

Assigning colour to class:

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class))

Task 1

You can assign other arguments within geom_point() to customize the appearance of the points in your scatter plot beyond just colour.

Try changing the size, alpha, and shape instead of color.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, shape = class))
## Warning: The shape palette can deal with a maximum of 6 discrete values because more
## than 6 becomes difficult to discriminate
## ℹ you have requested 7 values. Consider specifying shapes manually if you need
##   that many have them.
## Warning: Removed 62 rows containing missing values or values outside the scale range
## (`geom_point()`).

Setting a value manually to all points. Note that the color argument is in the geom_point function but not inside the aes (aesthetics) function. This will change the colour of all points.

ggplot(data = mpg) + 
  geom_point(color = "blue", mapping = aes(x = displ, y = hwy))

Try changing the shape as well. Search for ‘R point shapes’ to see what is available.

ggplot(data = mpg) + 
  geom_point(color = "blue", shape = 4, mapping = aes(x = displ, y = hwy))

ggplot(data = mpg) + 
  geom_point(size = 10, color = "springgreen4", alpha = 0.1, mapping = aes(x = displ, y = hwy))

ggplot works in layers.

Task 2

Try adding one more layer of size 1 black points.

ggplot(mpg, aes(x = displ, y = hwy)) + 
  geom_point(size = 20, color = 'coral4') +
  geom_point(size = 10, color = 'chocolate2') +
  geom_point(size = 5, color = 'darkgoldenrod1') +
  geom_point(size = 1, color = 'black')

Task 3

Why doesn’t this make the points blue? Fix this problem.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))

ggplot(data = mpg) + 
  geom_point(color = 'blue', mapping = aes(x = displ, y = hwy))

Which of the mpg variables are categorical?

What happens when you map a “continuous” variable to a “categorical” aesthetic? What happens when you map a “categorical” variable to a “continuous” aesthetic?

Here’s a categorical variable mapped to the x-axis (a typically continuous aesthetic).

ggplot(mpg) +
  geom_point(aes(x = class, y = hwy))

Here’s a continuous data mapped to shape. What happens?

ggplot(mpg) +
  geom_point(aes(x = cty, y = hwy, shape = displ))

What happens when we map the same variable to lots of aesthetics?

ggplot(mpg) +
  geom_point(aes(x = hwy, y = hwy, color = hwy, alpha = hwy))

What does stroke do?

?geom_point

Try playing with some of the parameters:

ggplot(mpg, aes(hwy, displ)) +
  geom_point(shape = 23, colour = "slateblue4", fill = "slateblue1", size = 3, stroke = 2)

Task 4

What happens when you map an aesthetic to something other than a variable name?

Try setting color to displ < 5

ggplot(mpg) + 
  geom_point(aes(x = hwy, y = displ, color = displ < 5))

It maps the logical vector (displ < 5) to the color aesthetic. This will create two groups of points:

  • Points where displ is less than 5 will be coloured according to the default color scale.
  • Points where displ is greater than or equal to 5 will be coloured differently.

Try using facets instead of color:

Facets in ggplot2 are a powerful way to create multiple subplots within a single plot, allowing you to explore how the relationship between variables changes across different groups or categories.

How facet_wrap() works:

facet_wrap(~ class, nrow = 2):
    facet_wrap() is the function that creates the facets.
    ~ class specifies that the data should be divided into separate subplots based on the values in the class variable.
    nrow = 2 controls the layout of the subplots. In this case, the subplots will be arranged in two rows.
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ class, nrow = 2)

Recreating Gapminder

gm <- read.csv("gapminder.csv", header=TRUE)

What are the variables and their types of this data?

glimpse(gm)
## Rows: 1,704
## Columns: 6
## $ country   <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
## $ continent <chr> "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "Asi…
## $ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
## $ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
## $ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …

Task 5

How do we recreate the Hans Rosling image? Which attributes should map on to which aesthetic?

HINT: plot gdpPercap on a log scale by adding scale_x_log10() as a layer to the chart.

ggplot(gm)+
  geom_point(aes(x=gdpPercap, y=lifeExp, color=continent, size=pop, alpha=year))+
  scale_x_log10()

ggplot(gm)+
  geom_point(aes(x=gdpPercap, y=lifeExp, color=continent, size=pop))+
  facet_wrap(~year) +
  scale_x_log10()

WE WILL STOP HERE FOR TODAY.


YOU CAN WORK THROUGH THE REST OF THE MATERIALS IN YOUR OWN TIME.

Histograms and Density

probly <- read_csv('probly.csv')
## Rows: 46 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (17): Almost Certainly, Highly Likely, Very Good Chance, Probable, Likel...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Click on the probly dataset in your environment to view it. Try a few of the different words from the column headers:

ggplot(probly, aes(x = `About Even`)) +
  geom_histogram(binwidth = 1)

Task 6

How many bins is too many or too few? Play around with the number of bins

ggplot(probly, aes(x = `About Even`)) +
  geom_histogram(bins = 10)

What about sizing by binwidth? Play around with the binwidth

ggplot(probly, aes(x = `About Even`)) +
  geom_histogram(binwidth = 2)

Density plots

ggplot(probly, aes(x = `About Even`)) +
  geom_density()

We have to reshape the data to do all of them at once. Remember we talked briefly about this in Week 1 lecture? Inspect the ‘tall’ dataset once you have created it.

probly_tall <- gather(probly, word, value)
glimpse(probly_tall)
## Rows: 782
## Columns: 2
## $ word  <chr> "Almost Certainly", "Almost Certainly", "Almost Certainly", "Alm…
## $ value <dbl> 95.0, 95.0, 95.0, 95.0, 98.0, 95.0, 85.0, 97.0, 95.0, 90.0, 90.0…
ggplot(probly_tall) +
  geom_point(aes(x = value, y = word))

Overlapping points using categorical data:

We can visualise overlapping points using ‘alpha’ transparency:

ggplot(probly_tall) +
  geom_point(alpha = 0.3, aes(x = value, y = word))

Or by using the position_jitter function to shift values a little

ggplot(probly_tall) +
  geom_point(position = position_jitter(height = 0.4),
             aes(x = value, y = word))

Sorting things:

The fct_reorder() function is from the forcats package, which is part of tidyverse. fct_reorder() can reorder the factor levels based on the mean of the value variable within each word group.

Note: data types matter. When we created the probly_tall dataset, the ‘word’ column (based on column headers from ‘probly’) was saved as a factor. Factors are categorical variables with defined levels. If this column was of character type, we would need to convert the word column to a factor using as.factor()

probly_tall <- probly_tall %>% 
  mutate(word = fct_reorder(.f = word, .x = value, .fun = mean))

This is definitely easier to view:

ggplot(probly_tall) +
  geom_point(position = position_jitter(height = 0.2), 
             aes(x = value, y = word))

Ridges.

ggridges is an R package that provides a function called geom_density_ridges() for creating ridgeline plots.

What are Ridgeline Plots?

Ridgeline plots are a visually appealing way to compare the distributions of a continuous variable across different groups or categories.   

They are similar to density plots, but instead of overlapping them, they are stacked vertically, creating a series of “ridges” that resemble a mountain range.

library(ggridges)
ggplot(probly_tall) +
  geom_density_ridges(aes(x = value, y = word), fill = 'lightblue', color = 'white')
## Picking joint bandwidth of 3.43

ggplot(probly_tall) +
  geom_density_ridges(aes(x = value, y = word))
## Picking joint bandwidth of 3.43

Pie and Proportions

Data from https://ncses.nsf.gov/pubs/nsf19301/data.

load('gendphd.rda')
glimpse(gendphd)
## Rows: 126
## Columns: 8
## $ field      <fct> all, all, all, all, all, all, all, all, all, all, all, all,…
## $ gend       <chr> "female", "female", "female", "female", "female", "female",…
## $ year       <chr> "1987", "1992", "1997", "2002", "2007", "2012", "2017", "19…
## $ count      <dbl> 11431, 14435, 17242, 18140, 21904, 23527, 25495, 20934, 242…
## $ perc       <dbl> 35.3, 37.3, 40.9, 45.4, 45.5, 46.2, 46.7, 64.7, 62.7, 59.1,…
## $ label      <chr> "All fields", "All fields", "All fields", "All fields", "Al…
## $ year_num   <dbl> 1987, 1992, 1997, 2002, 2007, 2012, 2017, 1987, 1992, 1997,…
## $ label_wrap <chr> "All fields", "All fields", "All fields", "All fields", "Al…
g <- gendphd %>% 
  filter(year == '1987', label == 'Life sciences') 

I don’t do pie charts often, so I actually pulled the examples from here.

ggplot(g) +
  geom_bar(aes(x = '', y = perc, fill = gend), stat = 'identity') +
  coord_polar('y', start = 0)

But lets leave that behind.

ggplot(g) +
  geom_bar(aes(x = 1, y = perc, fill = gend), stat = 'identity')

Add some labels and clean it up.

ggplot(g) +
  geom_bar(aes(x = 1, y = perc, fill = gend), stat = 'identity') +
  scale_y_continuous('Percent of Doctorates by Gender') +
  scale_x_continuous('', labels = NULL) +
  labs(title = 'Doctorates in Life Sciences in 1987') +
  theme_minimal()

But using this we can pull in all the fields at once.

g <- gendphd %>% 
  filter(year == '1987') 
ggplot(g) +
  geom_bar(aes(x = label, y = perc, fill = gend), stat = 'identity') +
  scale_y_continuous('Percent of Doctorates by Gender') +
  labs(title = 'Doctorates in 1987') +
  theme_minimal()

ggplot(g) +
  geom_bar(aes(x = label, y = perc, fill = gend), stat = 'identity') +
  scale_y_continuous('Percent of Doctorates by Gender') +
  labs(title = 'Doctorates in 1987') +
  coord_flip() +
  theme_minimal()

Reorder it

g <- g %>% 
  mutate(label = fct_reorder(label, perc, .fun = min))
ggplot(g) +
  geom_bar(aes(x = label, y = perc, fill = gend), stat = 'identity') +
  scale_y_continuous('Percent of Doctorates by Gender') +
  labs(title = 'Doctorates in 1987') +
  coord_flip() +
  theme_minimal()

Using facets we can use them all. The other method of sorting by the overall minimum doesn’t really work very well. We need to be more specific about how to sort things.

label_order <- gendphd %>% 
  filter(year == 1987, gend == 'female') %>% 
  arrange(perc) %>% 
  pull(label)
g <- gendphd %>% 
  mutate(label = factor(label, levels = label_order))
ggplot(g) +
  geom_bar(aes(x = label, y = perc, fill = gend), stat = 'identity') +
  scale_y_continuous('Percent of Doctorates by Gender') +
  labs(title = 'Doctorates in 1987') +
  coord_flip() +
  facet_wrap(~ year, nrow = 1) +
  theme_minimal()

Task 7

Can you facet by the field and sort by year instead? What would that chart look like?

g2 <- gendphd %>%
  mutate(year = factor(year, levels = sort(unique(year)))) %>%
  arrange(field, year)

ggplot(g2) +
  geom_bar(aes(x = year, y = perc, fill = gend), stat = 'identity') +
  scale_y_continuous('Percent of Doctorates by Gender') +
  labs(title = 'Distribution of Doctorates by Field and Year') +
  coord_flip() +
  facet_wrap(~ field, scales = "free_y") +
  theme_minimal()

ggplot(g) +
  geom_bar(aes(x = year, y = perc, fill = gend), stat = 'identity') +
  scale_y_continuous('Percent of Doctorates by Gender') +
  labs(title = 'Consistent Increase in Women PhDs',
       subtitle = 'The proportion of women awarded doctorates within each field by year from 1987-2017', x = '') +
  scale_fill_manual("Gender", values = c("#DF8C95", "#532A31")) +
  coord_flip() +
  facet_wrap(~ label_wrap, nrow = 1) +
  theme_minimal() +
  theme(legend.position = 'bottom')

Economist Data

load('economist_data.rda')

Dot plots and bar plots…

What happens here if you flip the x and y?

ggplot(corbyn, aes(x = political_group, y = avg_facebook_likes)) +
  geom_bar(stat = 'identity')

You can do it manually, or use coord_flip() to do it for you

ggplot(corbyn, aes(x = political_group, y = avg_facebook_likes)) +
  geom_bar(stat = 'identity') +
  coord_flip()

Sort them

corbyn <- corbyn %>% 
  mutate(political_group = fct_reorder(political_group, avg_facebook_likes))
ggplot(corbyn, aes(x = political_group, y = avg_facebook_likes)) +
  geom_bar(stat = 'identity') +
  coord_flip()

Dot plot:

ggplot(corbyn, aes(x = political_group, y = avg_facebook_likes)) +
  geom_point() +
  coord_flip()

Label it directly:

ggplot(corbyn, aes(x = political_group, y = avg_facebook_likes, label = political_group)) +
  geom_point() +
  geom_text(hjust = -0.1) +
  coord_flip()

Remove the labels on the scale, and remove the label for the axis.

ggplot(corbyn, aes(x = political_group, y = avg_facebook_likes, label = political_group)) +
  geom_point() +
  geom_text(hjust = -0.1) +
  scale_x_discrete('', labels = NULL) +
  coord_flip()

ggplot(corbyn, aes(x = political_group, y = avg_facebook_likes, label = political_group)) +
  geom_point() +
  geom_text(hjust = -0.1) +
  scale_x_discrete('', labels = NULL) +
  scale_y_continuous('Average Facebook Likes', limits = c(0, 6000)) +
  coord_flip()